fix: enable multi-GPU DDP training in Jupyter notebooks by mfazrinizar · Pull Request #928 · roboflow/rf-detr

mfazrinizar · 2026-04-06T21:13:08Z

What does this PR do?

Fixes multi-GPU DDP training (strategy="ddp_notebook" and strategy="ddp_spawn") which was completely broken in Jupyter (e.g. Kaggle) notebook environments. The fix addresses two layers of issues:

CUDA early initialization: RFDETRBase() eagerly moved the model to CUDA during __init__(), and module-level torch.cuda.is_available() in config.py created a CUDA driver context at import time, making multi-process training impossible.
OpenMP thread pool corruption after fork: Even after fixing CUDA init, PyTorch's OpenMP thread pool (created during model construction) cannot survive fork(). The worker threads become zombie handles, causing SIGABRT: Invalid thread pool! when the autograd engine initializes in forked children. Fixed by transparently replacing fork-based DDP with a spawn-based strategy.

Related Issue(s): Fixes #923

Type of Change

Bug fix (non-breaking change that fixes an issue)

Testing

I have tested this change locally
I have added/updated tests for this change

Test details:

Unit tests (101 pass locally)

test_build_trainer.py: 52 tests covering precision resolution, strategy selection, ddp_notebook→spawn mapping, EMA guards, logger wiring
test_module_data.py: 49 tests including test_ddp_notebook_preserves_num_workers and test_other_strategy_preserves_num_workers

Integration test (Kaggle T4 x2)

Validated on Kaggle GPU T4 x2 accelerator (Python 3.12, PyTorch 2.10.0+cu128, PTL 2.6.1):

Test	Result	Time
CUDA not initialized after `RFDETRBase()`	✅ PASS	—
Model weights on CPU after construction	✅ PASS	—
`strategy="ddp_notebook"` training (3 epochs, 2×T4)	✅ PASS	84.3s
`strategy="ddp_spawn"` training (3 epochs, 2×T4)	✅ PASS	77.4s
Inference after DDP training	✅ PASS	—

What This Fixes

Scenario	Before	After
`model.train(devices=2, strategy="ddp_notebook")` in notebook	❌ CUDA re-init / SIGABRT	✅ Works
`model.train(devices=2, strategy="ddp_spawn")` in notebook	❌ CUDA re-init / MisconfigurationException	✅ Works
`model.train(devices=1)`	✅ Works	✅ Works (no regression)
`model.predict(img)`	✅ Works	✅ Works (lazy device placement)
`model.train() → model.predict(img)`	✅ Works	✅ Works
`model.export_onnx()` / `model.optimize_for_inference()`	✅ Works	✅ Works

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

The ddp_notebook → spawn conversion is transparent to users: they continue passing strategy="ddp_notebook" (or strategy="ddp_spawn") and training just works. An INFO log message is emitted:

[INFO] rf-detr - ddp_notebook → spawn-based DDP to avoid OpenMP thread pool corruption after fork.

The find_unused_parameters=True flag is required because RF-DETR's architecture has parameters in the detection head that may not contribute to every loss term (e.g. encoder-only auxiliary losses).

Technical Details

Two layers of CUDA initialization that had to be fixed

Module-level (config.py): torch.cuda.is_available() creates a CUDA driver context at import time. Fixed with torch.accelerator.current_accelerator() which queries NVML without creating a primary context.
Model construction (inference.py): nn_model.to("cuda") fully initializes the CUDA runtime. Fixed by keeping the model on CPU and deferring .to(device) to first predict()/export()/batch_size="auto" call via _ensure_model_on_device().

Why spawn instead of fork

PyTorch creates an OpenMP thread pool (default 8 threads) during the first tensor operation (model construction). fork() only copies the calling thread, OMP worker threads become zombie handles. When the autograd engine in forked children calls set_num_threads during thread_init, the OMP runtime finds an invalid pool state and aborts:

terminate called after throwing an instance of 'c10::Error'
  what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64

This is a fundamental fork+OMP incompatibility; as far as I know, there is no library-level workaround. The fix transparently replaces fork-based ddp_notebook with a spawn-based _NotebookSpawnDDPStrategy whose launcher is marked is_interactive_compatible = True, allowing PTL to accept it in notebook environments.

Performance impact

First predict() call: ~50-200ms one-time latency from CPU→GPU model transfer. Strictly one-time, _ensure_model_on_device() checks first_param.device != target and becomes a no-op once the model is on GPU. After train(), the PTL-trained model is already on CUDA (synced at line 548), so even the first post-training predict() has zero transfer cost.
Subsequent predict() calls: Zero overhead (single next(parameters()).device comparison)
Production inference (RFDETRBase() → predict() without training): The one-time transfer happens on the very first call only. All subsequent calls, including batch evaluation loops, are zero-overhead.
Training: Zero impact (PTL builds its own model on CPU and handles device placement)
DDP spawn vs fork: ~12s additional startup for process spawn (one-time per training run)

codecov · 2026-04-07T06:21:09Z

Codecov Report

❌ Patch coverage is 89.83051% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (3f3bab3) to head (b4e82e4).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@          Coverage Diff           @@
##           develop   #928   +/-   ##
======================================
  Coverage       79%    79%           
======================================
  Files           97     97           
  Lines         7793   7846   +53     
======================================
+ Hits          6148   6195   +47     
- Misses        1645   1651    +6

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Fixes multi-GPU DDP training in interactive notebook environments by preventing early CUDA initialization and by transparently switching notebook DDP strategies away from fork to a spawn-based launcher/strategy.

Changes:

Add a notebook-safe spawn-based DDPStrategy replacement for ddp_notebook / ddp_spawn in the trainer factory.
Defer inference-model .to(device) until first use via a new lazy device-placement helper.
Replace direct torch.cuda.is_available() checks with a device constant intended to avoid CUDA context creation at import time, and update tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/rfdetr/config.py`	Introduces `_detect_device()` / `DEVICE` to avoid CUDA runtime init at import time.
`src/rfdetr/inference.py`	Stops eager model `.to(device)` during model-context construction to prevent early CUDA init.
`src/rfdetr/detr.py`	Adds `_ensure_model_on_device()` and calls it from inference/export/optimize/auto-batch paths.
`src/rfdetr/training/trainer.py`	Maps `ddp_notebook`/`ddp_spawn` to a spawn-based, interactive-compatible DDP strategy.
`src/rfdetr/training/module_model.py`	Uses `config.DEVICE` for compile gating instead of `torch.cuda.is_available()`.
`src/rfdetr/training/module_data.py`	Uses `config.DEVICE` for pin_memory decisions; preserves configured num_workers.
`tests/training/test_build_trainer.py`	Adds coverage for spawn-based DDP mapping and ddp_notebook precision probing.
`tests/training/test_module_data.py`	Adds tests asserting num_workers/prefetch_factor preservation for strategies.
`tests/training/test_module_model.py`	Updates compile test to patch `config.DEVICE` instead of `torch.cuda.is_available()`.

src/rfdetr/config.py

src/rfdetr/training/trainer.py

src/rfdetr/inference.py

src/rfdetr/detr.py

- Adds `hasattr(torch, "accelerator")` outer guard in `_detect_device()` so PyTorch < 2.4 (where `torch.accelerator` module does not exist) does not raise AttributeError at import time --- Co-authored-by: Claude Code <noreply@anthropic.com>

- Assertions are stripped with `python -O`; use explicit if+raise for required runtime guards --- Co-authored-by: Claude Code <noreply@anthropic.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…inizar/rf-detr into fix/ddp-notebook-cuda-init

- _MultiProcessingLauncher has no public equivalent in PTL 2.x; adds a comment to monitor for breakage when bumping the PTL lower bound --- Co-authored-by: Claude Code <noreply@anthropic.com>

- Old docstring said "moves the model to the target device" and "ready for inference", both no longer true; model is kept on CPU and moved lazily by _ensure_model_on_device on first use --- Co-authored-by: Claude Code <noreply@anthropic.com>

- Satisfies static analysis requirements; function accepts duck-typed stand-ins, which Any correctly reflects --- Co-authored-by: Claude Code <noreply@anthropic.com>

…n DDP - torch.cuda.is_available() + is_bf16_supported() initialize CUDA in the parent; add a comment documenting this is intentional because all DDP paths use spawn, not fork --- Co-authored-by: Claude Code <noreply@anthropic.com>

- Inline comment inside build_trainer() was a near-verbatim repeat of the module-level block; replaced with a brief cross-reference --- Co-authored-by: Claude Code <noreply@anthropic.com>

…ect_device fallback - test_train_auto_batch_ensures_model_on_device_before_resolve: verifies device placement happens before auto-batch probing (detr.py:512-516) - test_detect_device_falls_back_when_torch_accelerator_absent: simulates PyTorch < 2.4 with no torch.accelerator module - test_detect_device_falls_back_when_current_accelerator_raises: covers RuntimeError catch path - test_detect_device_returns_cpu_when_no_gpu: covers CPU-only fallback --- Co-authored-by: Claude Code <noreply@anthropic.com>

--- Co-authored-by: Claude Code <noreply@anthropic.com>

…inizar/rf-detr into fix/ddp-notebook-cuda-init

src/rfdetr/training/trainer.py

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

@patch

- TestDetectDevice: use @patch decorator + MagicMock(spec=[]) to simulate missing current_accelerator without PropertyMock or class mutation - test_train_auto_batch_ensures_model_on_device_before_resolve: convert to @patch decorators, drop unused tmp_path, remove spurious rfdetr.detr.resolve_auto_batch_config patch (local import means only rfdetr.training.auto_batch is the correct target), explicit side_effect functions replacing fragile `lambda ... or` pattern --- Co-authored-by: Claude Code <noreply@anthropic.com>

@patch

…lly to @patch decorators - Remove inline 'import unittest.mock as mock' from test body - Add module-level 'from unittest.mock import MagicMock, patch' - Three context-manager patches → three @patch decorators - mock_trainer.side_effect replaces nested _fake_trainer closure --- Co-authored-by: Claude Code <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

src/rfdetr/training/trainer.py

src/rfdetr/training/module_model.py

src/rfdetr/training/module_data.py

src/rfdetr/detr.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- guard private PTL launcher import with clear runtime error path\n- respect explicit CPU accelerator when gating compile/pin_memory\n- fix optimize_for_inference CUDA-context tests on CPU builds\n- add focused regression tests for launcher compatibility and accelerator overrides\n\nCo-authored-by: OpenAI Codex <codex@openai.com>

mfazrinizar added 11 commits April 6, 2026 04:37

fix: defer CUDA init to enable DDP training in notebooks

2f814fd

fix: skip CUDA bf16 probe for ddp_notebook strategy

9814557

fix: eliminate all CUDA driver context leaks before DDP fork

9104d43

fix: use overridden num_workers in all dataloaders for ddp_notebook

34ab21b

fix: possible thread-state corruption from fork()

6ce6ad0

revert: remove torch.set_num_threads that crashes forked DDP children

ed99190

fix: use spawn-based DDP for ddp_notebook to avoid OpenMP SIGABRT

a31dcb2

fix: adding logger for ddp_notebook strategy

728c1e5

fix: use spawn-based DDP for ddp_notebook to avoid OpenMP SIGABRT

a464cf2

fix: remove unnecessary num_workers=0 override for ddp_notebook

08af3c5

fix: use standard precision probing for DDP and guard auto-batch

bcdfd0a

mfazrinizar requested review from Borda, SkalskiP, isaacrob and probicheaux as code owners April 6, 2026 21:13

mfazrinizar and others added 3 commits April 7, 2026 04:13

Merge branch 'develop' into fix/ddp-notebook-cuda-init

c4c88f2

fix(pre-commit): 🎨 auto format pre-commit hooks

e7a84d0

style: fix ruff E402 imports and codespell in DDP tests

e67ac24

fix: handle None from torch.accelerator on CPU-only environments

1927ca5

Borda added the bug Something isn't working label Apr 8, 2026

Borda requested a review from Copilot April 8, 2026 16:41

Copilot started reviewing on behalf of Borda April 8, 2026 16:41 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Borda and others added 6 commits April 8, 2026 19:09

fix: replace assert with RuntimeError in _NotebookSpawnDDPStrategy

ea8eddf

- Assertions are stripped with `python -O`; use explicit if+raise for required runtime guards --- Co-authored-by: Claude Code <noreply@anthropic.com>

Apply suggestions from code review

ef80e40

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(pre-commit): 🎨 auto format pre-commit hooks

28582bf

Merge branch 'fix/ddp-notebook-cuda-init' of https://github.com/mfazr…

680b308

…inizar/rf-detr into fix/ddp-notebook-cuda-init

docs: note private PTL launcher API risk in trainer.py

10b35fc

- _MultiProcessingLauncher has no public equivalent in PTL 2.x; adds a comment to monitor for breakage when bumping the PTL lower bound --- Co-authored-by: Claude Code <noreply@anthropic.com>

Borda and others added 7 commits April 8, 2026 19:13

fix: add Any type annotation to _ensure_model_on_device parameter

8602d95

- Satisfies static analysis requirements; function accepts duck-typed stand-ins, which Any correctly reflects --- Co-authored-by: Claude Code <noreply@anthropic.com>

docs: consolidate duplicated OMP fork explanation in trainer.py

a16ba34

- Inline comment inside build_trainer() was a near-verbatim repeat of the module-level block; replaced with a brief cross-reference --- Co-authored-by: Claude Code <noreply@anthropic.com>

lint: fix import ordering in test_config.py (I001)

16023ab

--- Co-authored-by: Claude Code <noreply@anthropic.com>

Merge branch 'fix/ddp-notebook-cuda-init' of https://github.com/mfazr…

31a6fb7

…inizar/rf-detr into fix/ddp-notebook-cuda-init

Borda reviewed Apr 8, 2026

View reviewed changes

src/rfdetr/training/trainer.py Outdated Show resolved Hide resolved

Borda reviewed Apr 8, 2026

View reviewed changes

src/rfdetr/training/trainer.py Outdated Show resolved Hide resolved

Borda and others added 4 commits April 8, 2026 19:32

Apply suggestions from code review

c74e181

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

fix(pre-commit): 🎨 auto format pre-commit hooks

42fd15c

Borda previously approved these changes Apr 8, 2026

View reviewed changes

Borda requested a review from Copilot April 8, 2026 18:44

Copilot started reviewing on behalf of Borda April 8, 2026 18:45 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Apply suggestions from code review

73a073d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Borda dismissed their stale review via 73a073d April 8, 2026 19:01

Borda added 2 commits April 8, 2026 21:38

Merge branch 'develop' into fix/ddp-notebook-cuda-init

b4e82e4

Borda approved these changes Apr 8, 2026

View reviewed changes

Borda merged commit a6a080e into roboflow:develop Apr 8, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable multi-GPU DDP training in Jupyter notebooks#928

fix: enable multi-GPU DDP training in Jupyter notebooks#928
Borda merged 35 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init

mfazrinizar commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mfazrinizar commented Apr 6, 2026

What does this PR do?

Type of Change

Testing

Unit tests (101 pass locally)

Integration test (Kaggle T4 x2)

What This Fixes

Checklist

Additional Context

Technical Details

Two layers of CUDA initialization that had to be fixed

Why spawn instead of fork

Performance impact

Uh oh!

codecov bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Apr 7, 2026 •

edited

Loading